Chapter 6 Exercises
These are intended to be done after completing the worked examples.
6.1 Exercise 1 — https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE119732
Using GSE119732,
- Map the distribution of the samples as a box plot and a density plot using only the raw data.
First, load required libraries, and load the data.
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: generics
##
## Attaching package: 'generics'
## The following objects are masked from 'package:base':
##
## as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
## setequal, union
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, aperm, append, as.data.frame, basename, cbind,
## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
## get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
## unsplit, which.max, which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
## Setting options('download.file.method.GEOquery'='auto')
## Setting options('GEOquery.inmemory.gpl'=FALSE)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:Biobase':
##
## combine
## The following objects are masked from 'package:BiocGenerics':
##
## combine, intersect, setdiff, setequal, union
## The following object is masked from 'package:generics':
##
## explain
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tidyverse
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.1 ✔ readr 2.1.6
## ✔ ggplot2 4.0.1 ✔ stringr 1.6.0
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::combine() masks Biobase::combine(), BiocGenerics::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Using locally cached version of supplementary file(s) GSE119732 found here:
## data/GSE119732/GSE119732_count_table_RNA_seq.txt.gz
path1 <- file.path("data", gse1)
files1 <- list.files(path1, pattern = "\\.txt.gz$|\\.tsv.gz$|\\.csv.gz$",
full.names = TRUE, recursive = TRUE)
files1## [1] "data/GSE119732/GSE119732_count_table_RNA_seq.txt.gz"
The file is a .txt file. Using the raw data, we can map the distributions of the samples as a box plot.
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## gene_id = col_character()
## )
## ℹ Use `spec()` for the full column specifications.

Secondly, we can also map the distributions of the samples as a density plot.

- Add colours to the plot separating the different variables in the model
The density plot already has colours (?).
- Filter out lowly expressed genes and re plot the distributions.
First, we convert the counts to Counts Per Million (CPM). Then, we filter out lowly expressed genes.
## Loading required package: limma
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
# convert to CPM
gse1_cpm <- cpm(y = gse1_data[,2:ncol(gse1_data)])
# remove lowly expressed genes
to_remove <- edgeR::filterByExpr(gse1_cpm,min.count = 3)## No group or design set. Assuming all samples belong to one group.
Now, we replot the samples’ distribution.


6.2 Exercise 2 — https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122380
Using GSE122380, confirm whether the ID column contains Ensembl IDs with version suffixes.
- Map the distribution of the samples as a box plot and a density plot using only the raw data.
First, load the data.
library(GEOquery)
library(knitr)
library(dplyr)
library(tibble)
if(!require(tidyverse)){
install.packages("tidyverse",dependencies = TRUE)
}
library(tidyverse)
gse2 <- "GSE122380"
source("./supp_functions.R")
fetch_geo_supp(gse = gse2)## Using locally cached version of supplementary file(s) GSE122380 found here:
## data/GSE122380/GSE122380_Supplementary_Data_Table_S1.xlsx
## Using locally cached version of supplementary file(s) GSE122380 found here:
## data/GSE122380/GSE122380_raw_counts.txt.gz
path2 <- file.path("data", gse2)
files2 <- list.files(path2, pattern = "\\.txt.gz$|\\.tsv.gz$|\\.csv.gz$",
full.names = TRUE, recursive = TRUE)
files2## [1] "data/GSE122380/GSE122380_raw_counts.txt.gz"
First we preview the data:
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Gene_id = col_character()
## )
## ℹ Use `spec()` for the full column specifications.